{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import time\n",
    "import warnings; warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "from Bio.Seq import Seq\n",
    "from Bio.Alphabet import generic_dna\n",
    "from functools import partial\n",
    "from sklearn.cluster import DBSCAN, MiniBatchKMeans\n",
    "from sklearn.neighbors import BallTree\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "from icing.core import distances; reload(distances)\n",
    "from icing.core.distances import *\n",
    "from icing.externals.DbCore import parseAllele, gene_regex, junction_re\n",
    "from icing.utils import io\n",
    "from icing import inference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ICING tutorial\n",
    "<hr>\n",
    "ICING is a IG clonotype inference library developed in Python.\n",
    "\n",
    "<font color=\"red\"><b>NB:</b></font> This is <font color=\"red\"><b>NOT</b></font> a quickstart guide for ICING. This intended as a detailed tutorial on how ICING works internally. If you're only interested into using ICING, please refer to the [Quickstart Manual on github](https://github.com/slipguru/icing#quickstart), or the <font color=\"blue\">Quickstart section at the end of this notebook</font>.\n",
    "\n",
    "ICING needs as input a file (TAB-delimited or CSV) which contains, in each row, a particular sequence.\n",
    "The format used is the same as returned by Change-O's `MakeDb.py` script, which, starting from a IMGT results, it builds a single file with all the information extracted from IMGT starting from the RAW fasta sequences.\n",
    "\n",
    "## 0. Data loading\n",
    "\n",
    "Load the dataset into a single `pandas` dataframe called '`df`'.\n",
    "\n",
    "The dataset MUST CONTAIN at least the following columns (NOT case-sensitive):\n",
    "- SEQUENCE_ID\n",
    "- V_CALL\n",
    "- J_CALL\n",
    "- JUNCTION\n",
    "- MUT (only if correct is True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db_file = '../examples/data/clones_100.100.tab'\n",
    "\n",
    "# dialect=\"excel\" for CSV or XLS files\n",
    "# for computational reasons, let's limit the dataset to the first 1000 sequences\n",
    "X = io.load_dataframe(db_file, dialect=\"excel-tab\")[:1000] \n",
    "\n",
    "# turn the following off if data are real\n",
    "# otherwise, assume that the \"SEQUENCE_ID\" field is composed as\n",
    "# \"[db]_[extension]_[id]_[id-true-clonotype]_[other-info]\"\n",
    "# See the example file for the format of the input.\n",
    "X['true_clone'] = [x[3] for x in X.sequence_id.str.split('_')]  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Preprocessing step: data shrinking\n",
    "\n",
    "Specially in CLL patients, most of the input sequences have the same V genes AND junction. In this case, it is possible to remove such sequences from the analysis (we just need to remember them after.)\n",
    "In other words, we can collapse repeated sequences into a single one, which will weight as high as the number of sequences it represents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# group by junction and v genes\n",
    "groups = X.groupby([\"v_gene_set_str\", \"junc\"]).groups.values()\n",
    "idxs = np.array([elem[0] for elem in groups])  # take one of them\n",
    "weights = np.array([len(elem) for elem in groups])  # assign its weight"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. High-level group inference\n",
    "\n",
    "The number of sequences at this point may be still very high, in particular when IGs are mutated and there is not much replication. However, we rely on the fact that IG similarity is mainly constrained on their junction length. Therefore, we infer high-level groups based on their junction lengths.\n",
    "\n",
    "This is a fast and efficient step. Also, by exploiting `MiniBatchKMeans`, we can specify an upperbound on the number of clusters we want to obtain. However, contrary to the standard `KMeans` algorithm, in this case some clusters may vanish. If one is expected to have related IGs with very different junction lengths, however, it is reasonable to specify a low value of clusters.\n",
    "\n",
    "Keep in mind, however, that a low number of clusters correspond to higher computational workload of the method in the next phases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_clusters = 50\n",
    "\n",
    "X_all = idxs.reshape(-1,1)\n",
    "kmeans = MiniBatchKMeans(n_init=100, n_clusters=min(n_clusters, X_all.shape[0]))\n",
    "\n",
    "lengths = X['junction_length'].values\n",
    "kmeans.fit(lengths[idxs].reshape(-1,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Fine-grained group inference\n",
    "\n",
    "Now we have higih-level groups of IGs we have to extract clonotypes from.\n",
    "Divide the dataset based on the labels extracted from `MiniBatchKMeans`.\n",
    "For each one of the cluster, find clonotypes contained in it using `DBSCAN`.\n",
    "\n",
    "This algorithm allows us to use a custom metric between IGs.\n",
    "\n",
    "[<font color='blue'><b>ADVANCED</b></font>] To develop a custom metric, see the format of `icing.core.distances.distance_dataframe`. If you use a custom function, then you only need to put it as parameter of DBSCAN metric. Note that `partial` is required if the metric has more than 2 parameters. To be a valid metric for DBSCAN, the function must take ONLY two params (the two elements to compare). For this reason, the other arguments are pre-computed with `partial` in the following example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dbscan = DBSCAN(min_samples=20, n_jobs=-1, algorithm='brute', eps=0.2,\n",
    "                metric=partial(distance_dataframe, X, \n",
    "                    junction_dist=distances.StringDistance(model='ham'),\n",
    "                    correct=True, tol=0))\n",
    "\n",
    "dbscan_labels = np.zeros_like(kmeans.labels_).ravel()\n",
    "for label in np.unique(kmeans.labels_):\n",
    "    idx_row = np.where(kmeans.labels_ == label)[0]\n",
    "    \n",
    "    X_idx = idxs[idx_row].reshape(-1,1).astype('float64')\n",
    "    weights_idx = weights[idx_row]\n",
    "    \n",
    "    if idx_row.size == 1:\n",
    "        db_labels = np.array([0])\n",
    "    \n",
    "    db_labels = dbscan.fit_predict(X_idx, sample_weight=weights_idx)\n",
    "    \n",
    "    if len(dbscan.core_sample_indices_) < 1:\n",
    "        db_labels[:] = 0\n",
    "        \n",
    "    if -1 in db_labels:\n",
    "        # this means that DBSCAN found some IG as noise. We choose to assign to the nearest cluster\n",
    "        balltree = BallTree(\n",
    "            X_idx[dbscan.core_sample_indices_],\n",
    "            metric=dbscan.metric)\n",
    "        noise_labels = balltree.query(\n",
    "            X_idx[db_labels == -1], k=1, return_distance=False).ravel()\n",
    "        # get labels for core points, then assign to noise points based\n",
    "        # on balltree\n",
    "        dbscan_noise_labels = db_labels[\n",
    "            dbscan.core_sample_indices_][noise_labels]\n",
    "        db_labels[db_labels == -1] = dbscan_noise_labels\n",
    "    \n",
    "    # hopefully, there are no noisy samples at this time\n",
    "    db_labels[db_labels > -1] = db_labels[db_labels > -1] + np.max(dbscan_labels) + 1\n",
    "    dbscan_labels[idx_row] = db_labels  # + np.max(dbscan_labels) + 1\n",
    "\n",
    "labels = dbscan_labels\n",
    "\n",
    "# new part: put together the labels\n",
    "labels_ext = np.zeros(X.shape[0], dtype=int)\n",
    "labels_ext[idxs] = labels\n",
    "for i, list_ in enumerate(groups):\n",
    "    labels_ext[list_] = labels[i]\n",
    "labels = labels_ext"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quickstart\n",
    "<hr>\n",
    "\n",
    "All of the above-mentioned steps are integrated in ICING with a simple call to the class `inference.ICINGTwoStep`.\n",
    "The following is an example of a working script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db_file = '../examples/data/clones_100.100.tab'\n",
    "correct = True\n",
    "tolerance = 0\n",
    "\n",
    "X = io.load_dataframe(db_file)[:1000]\n",
    "\n",
    "# turn the following off if data are real\n",
    "X['true_clone'] = [x[3] for x in X.sequence_id.str.split('_')] \n",
    "true_clones = LabelEncoder().fit_transform(X.true_clone.values)\n",
    "\n",
    "ii = inference.ICINGTwoStep(\n",
    "    model='nt', eps=0.2, method='dbscan', verbose=True,\n",
    "    kmeans_params=dict(n_init=100, n_clusters=20),\n",
    "    dbscan_params=dict(min_samples=20, n_jobs=-1, algorithm='brute',\n",
    "            metric=partial(distance_dataframe, X, **dict(\n",
    "                junction_dist=StringDistance(model='ham'),\n",
    "                correct=correct, tol=tolerance))))\n",
    "\n",
    "tic = time.time()\n",
    "labels = ii.fit_predict(X)\n",
    "tac = time.time() - tic\n",
    "print(\"\\nElapsed time: %.1fs\" % tac)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to save the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X['icing_clones (%s)' % ('_'.join(('StringDistance', str(eps), '0', 'corr' if correct else  'nocorr',\n",
    "                                    \"%.4f\" % tac)))] = labels\n",
    "\n",
    "X.to_csv(db_file.split('/')[-1] + '_icing.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How is the result?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import metrics\n",
    "true_clones = LabelEncoder().fit_transform(X.true_clone.values)\n",
    "\n",
    "print \"FMI: %.5f\" % (metrics.fowlkes_mallows_score(true_clones, labels))\n",
    "print \"ARI: %.5f\" % (metrics.adjusted_rand_score(true_clones, labels))\n",
    "print \"AMI: %.5f\" % (metrics.adjusted_mutual_info_score(true_clones, labels))\n",
    "print \"NMI: %.5f\" % (metrics.normalized_mutual_info_score(true_clones, labels))\n",
    "print \"Hom: %.5f\" % (metrics.homogeneity_score(true_clones, labels))\n",
    "print \"Com: %.5f\" % (metrics.completeness_score(true_clones, labels))\n",
    "print \"Vsc: %.5f\" % (metrics.v_measure_score(true_clones, labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is it better or worse than the result with everyone at the same time?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = dbscan.fit_predict(np.arange(X.shape[0]).reshape(-1, 1))\n",
    "\n",
    "print \"FMI: %.5f\" % metrics.fowlkes_mallows_score(true_clones, labels)\n",
    "print \"ARI: %.5f\" % (metrics.adjusted_rand_score(true_clones, labels))\n",
    "print \"AMI: %.5f\" % (metrics.adjusted_mutual_info_score(true_clones, labels))\n",
    "print \"NMI: %.5f\" % (metrics.normalized_mutual_info_score(true_clones, labels))\n",
    "print \"Hom: %.5f\" % (metrics.homogeneity_score(true_clones, labels))\n",
    "print \"Com: %.5f\" % (metrics.completeness_score(true_clones, labels))\n",
    "print \"Vsc: %.5f\" % (metrics.v_measure_score(true_clones, labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This should be the same (or worse). We reduced the computational workload while maintaining or improving the result we would obtain without the step 1 and 2."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
